The goal of midterm is to apply some of the methods for supervised and unsupervised analysis to a new dataset. We will work with data characterizing the relationship between wine quality and its analytical characteristics available at UCI ML repository as well as in this course website on canvas. The overall goal will be to use data modeling approaches to understand which wine properties influence the most wine quality as determined by expert evaluation. The output variable in this case assigns wine to discrete categories between 0 (the worst) and 10 (the best), so that this problem can be formulated as classification or regression – here we will stick to the latter and treat/model outcome as continuous variable. For more details please see dataset description available at UCI ML or corresponding file in this course website on canvas. Please note that there is another, much smaller, dataset on UCI ML also characterizing wine in terms of its analytical properties – make sure to use correct URL as shown above, or, to eliminate possibility for ambiguity, the data available on the course website in canvas – the correct dataset contains several thousand observations. For simplicity, clarity and to decrease your dependency on the network reliability and UCI ML availability you are advised to download data made available in this course website to your local folder and work with this local copy.
There are two compilations of data available under the URL shown above as well as in the course website in canvas – separate for red and for white wine – please develop models of wine quality for each of them, investigate attributes deemed important for wine quality in both and determine whether quality of red and white wine is influenced predominantly by the same or different analytical properties (i.e. predictors in these datasets). Lastly, as an exercise in unsupervised learning you will be asked to combine analytical data for red and white wine and describe the structure of the resulting data – whether there are any well defined clusters, what subsets of observations they appear to represent, which attributes seem to affect the most this structure in the data, etc.
Finally, as you will notice, the instructions here are terser than in the previous homework assignments. We expect that you use what you’ve learned in the class to complete the analysis and draw appropriate conclusions based on the data. All approaches that you are expected to apply here have been exercised in the preceeding weekly assignments – please feel free to consult your submissions and/or official solutions as to how they have applied to different datasets. As always, if something appears to be unclear, please ask questions – we may change to private mode those that in our opinion reveal too many details as we see fit.
Download and read in the data, produce numerical and graphical summaries of the dataset attributes, decide whether they can be used for modeling in untransformed form or any transformations are justified, comment on correlation structure and whether some of the predictors suggest relationship with the outcome.
Red Wine:
#Load red.wine data
red.wine = read.csv(file="winequality-red.csv",header=TRUE,sep=";")
#Dimentions of data
dim(red.wine)
## [1] 1599 12
#List first 6 rows with headers
head(red.wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
#Summarizing data
summary(red.wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
#Boxplodding each attribute
old.par <- par(mar=c(1,1,1,1),mfrow=c(1,12),ps=6)
for(i in colnames(red.wine)){
boxplot(red.wine[,i],main=i)
}
par(old.par)
#Pair/Plot all 12 attributes of untransformed data
pairs(red.wine,pch=19,cex=red.wine$quality^3/1000, col=hsv(0, 1, 1-red.wine$quality/10))
#Pair/Plot all 12 attributes of log-transformed data
pairs(red.wine+1,pch=19,cex=red.wine$quality^3/1000,log="xy", col=hsv(0, 1, 1-red.wine$quality/10))
#Outlier detecting
#plot(red.wine[,c(7,1)])
#identify(red.wine[,c(7,1)])
#old.par<-par(mfrow=c(2,2))
#plot(lm(red.wine$quality~red.wine$fixed.acidity+red.wine$total.sulfur.dioxide, red.wine))
#1080 and 1082 rows are outliers as can be seen in plots of "total.sulfur.dioxide". But do not infuelnce the model red.wine$quality~red.wine[,1]+red.wine[,7] significantly
#Checking if there is zeros or NA in data
#length(red.wine[red.wine == 0])
#length(which(is.na(red.wine)))
#Check if log-transformation improves the data
old.par <- par(mar=c(1,1,1,1), mfrow=c(6,4),ps=6)
for ( col.n in colnames(red.wine)[1:6] ) {
hist(red.wine[,col.n],col='brown1',xlab=col.n, main=col.n)
qqnorm(red.wine[,col.n])
qqline(red.wine[,col.n], col="red")
hist(log(red.wine[,col.n]+1),col='lightblue',xlab=col.n, main=col.n)
qqnorm(log(red.wine[,col.n]+1))
qqline(log(red.wine[,col.n]+1), col="red")
}
par(old.par)
#Next 6 attributes
old.par <- par(mar=c(1,1,1,1), mfrow=c(6,4),ps=6)
for ( col.n in colnames(red.wine)[7:12] ) {
hist(red.wine[,col.n],col='brown1',xlab=col.n, main=col.n)
qqnorm(red.wine[,col.n])
qqline(red.wine[,col.n], col="red")
hist(log(red.wine[,col.n]+1),col='lightblue',xlab=col.n, main=col.n)
qqnorm(log(red.wine[,col.n]+1))
qqline(log(red.wine[,col.n]+1), col="red")
}
par(old.par)
Log-transformation improves all data attributes except “citric.acid” and “quality”. Log-transformation is justified for other attributes. For “density”“,”pH" attributes log-transformation has no significant effect.
#Log-transforming all except "citric.acid" and "quality"
red.wine.log = red.wine
for(cn in colnames(red.wine.log[,!names(red.wine.log) %in% c("citric.acid","quality")])){
red.wine.log[,cn] = log(red.wine[,cn]+1)
#colnames(red.wine.log)[names(red.wine.log) == cn] = paste0("L.",cn)
}
#Pair/Plot all 12 attributes of log-transformed data
#pairs(red.wine.log[,-grep("quality",colnames(red.wine.log))],cex=.2,pch=19,col=red.wine.log$quality*3)
#Correlation matrix
red.wine.cr=cor(red.wine,use="complete")
signif(red.wine.cr,3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.25600 0.6720
## volatile.acidity -0.2560 1.00000 -0.5520
## citric.acid 0.6720 -0.55200 1.0000
## residual.sugar 0.1150 0.00192 0.1440
## chlorides 0.0937 0.06130 0.2040
## free.sulfur.dioxide -0.1540 -0.01050 -0.0610
## total.sulfur.dioxide -0.1130 0.07650 0.0355
## density 0.6680 0.02200 0.3650
## pH -0.6830 0.23500 -0.5420
## sulphates 0.1830 -0.26100 0.3130
## alcohol -0.0617 -0.20200 0.1100
## quality 0.1240 -0.39100 0.2260
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11500 0.09370 -0.15400
## volatile.acidity 0.00192 0.06130 -0.01050
## citric.acid 0.14400 0.20400 -0.06100
## residual.sugar 1.00000 0.05560 0.18700
## chlorides 0.05560 1.00000 0.00556
## free.sulfur.dioxide 0.18700 0.00556 1.00000
## total.sulfur.dioxide 0.20300 0.04740 0.66800
## density 0.35500 0.20100 -0.02190
## pH -0.08570 -0.26500 0.07040
## sulphates 0.00553 0.37100 0.05170
## alcohol 0.04210 -0.22100 -0.06940
## quality 0.01370 -0.12900 -0.05070
## total.sulfur.dioxide density pH sulphates
## fixed.acidity -0.1130 0.6680 -0.6830 0.18300
## volatile.acidity 0.0765 0.0220 0.2350 -0.26100
## citric.acid 0.0355 0.3650 -0.5420 0.31300
## residual.sugar 0.2030 0.3550 -0.0857 0.00553
## chlorides 0.0474 0.2010 -0.2650 0.37100
## free.sulfur.dioxide 0.6680 -0.0219 0.0704 0.05170
## total.sulfur.dioxide 1.0000 0.0713 -0.0665 0.04290
## density 0.0713 1.0000 -0.3420 0.14900
## pH -0.0665 -0.3420 1.0000 -0.19700
## sulphates 0.0429 0.1490 -0.1970 1.00000
## alcohol -0.2060 -0.4960 0.2060 0.09360
## quality -0.1850 -0.1750 -0.0577 0.25100
## alcohol quality
## fixed.acidity -0.0617 0.1240
## volatile.acidity -0.2020 -0.3910
## citric.acid 0.1100 0.2260
## residual.sugar 0.0421 0.0137
## chlorides -0.2210 -0.1290
## free.sulfur.dioxide -0.0694 -0.0507
## total.sulfur.dioxide -0.2060 -0.1850
## density -0.4960 -0.1750
## pH 0.2060 -0.0577
## sulphates 0.0936 0.2510
## alcohol 1.0000 0.4760
## quality 0.4760 1.0000
#corrPairs - is correlation visualization function(modified code from lectures)
corrPairs(red.wine.cr)
#Correlation matrix
red.wine.log.cr=cor(red.wine.log,use="complete")
signif(red.wine.log.cr,3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 -0.2610 0.66900
## volatile.acidity -0.261 1.0000 -0.56400
## citric.acid 0.669 -0.5640 1.00000
## residual.sugar 0.159 0.0242 0.16800
## chlorides 0.120 0.0726 0.20200
## free.sulfur.dioxide -0.178 0.0207 -0.08780
## total.sulfur.dioxide -0.114 0.0841 -0.00255
## density 0.674 0.0300 0.36500
## pH -0.704 0.2320 -0.54400
## sulphates 0.191 -0.2830 0.32400
## alcohol -0.090 -0.2140 0.10900
## quality 0.116 -0.3930 0.22600
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.1590 0.12000 -0.17800
## volatile.acidity 0.0242 0.07260 0.02070
## citric.acid 0.1680 0.20200 -0.08780
## residual.sugar 1.0000 0.05590 0.10000
## chlorides 0.0559 1.00000 -0.00557
## free.sulfur.dioxide 0.1000 -0.00557 1.00000
## total.sulfur.dioxide 0.1540 0.06220 0.78400
## density 0.4060 0.21900 -0.03960
## pH -0.0896 -0.27300 0.09580
## sulphates 0.0156 0.33800 0.05530
## alcohol 0.0751 -0.23600 -0.08320
## quality 0.0217 -0.13500 -0.05030
## total.sulfur.dioxide density pH sulphates
## fixed.acidity -0.11400 0.6740 -0.7040 0.1910
## volatile.acidity 0.08410 0.0300 0.2320 -0.2830
## citric.acid -0.00255 0.3650 -0.5440 0.3240
## residual.sugar 0.15400 0.4060 -0.0896 0.0156
## chlorides 0.06220 0.2190 -0.2730 0.3380
## free.sulfur.dioxide 0.78400 -0.0396 0.0958 0.0553
## total.sulfur.dioxide 1.00000 0.1040 -0.0171 0.0593
## density 0.10400 1.0000 -0.3410 0.1570
## pH -0.01710 -0.3410 1.0000 -0.1840
## sulphates 0.05930 0.1570 -0.1840 1.0000
## alcohol -0.23700 -0.4920 0.2030 0.1150
## quality -0.17100 -0.1750 -0.0576 0.2810
## alcohol quality
## fixed.acidity -0.0900 0.1160
## volatile.acidity -0.2140 -0.3930
## citric.acid 0.1090 0.2260
## residual.sugar 0.0751 0.0217
## chlorides -0.2360 -0.1350
## free.sulfur.dioxide -0.0832 -0.0503
## total.sulfur.dioxide -0.2370 -0.1710
## density -0.4920 -0.1750
## pH 0.2030 -0.0576
## sulphates 0.1150 0.2810
## alcohol 1.0000 0.4770
## quality 0.4770 1.0000
#Correlation visualization of log-transformed dara
corrPairs(red.wine.log.cr)
#Finding collinearity
vif(lm(quality~.,data=red.wine.log))
## fixed.acidity volatile.acidity citric.acid
## 8.176442 1.832439 3.091536
## residual.sugar chlorides free.sulfur.dioxide
## 1.897230 1.447292 2.875551
## total.sulfur.dioxide density pH
## 3.161360 7.110926 3.546283
## sulphates alcohol
## 1.456824 3.187275
There is weak negative correlation (-0.39) between “quality” attribute and “valotile.acidity”, and positive correlation (0.48) between “quality” attribute “alcohol”.
There is significant correlation between “fixed.acidicy” and “citric.acid” (0.67), “dencity” (0.67), “pH” (-0.7). Also, “total.surfur.dioxide” and “free.surfur.dioxide” are correlated (0.78).
Collinearity is acceptable (less than 10).
White Wine:
#Load white.wine data
white.wine = read.csv(file="winequality-white.csv",header=TRUE,sep=";")
#Dimentions of data
dim(white.wine)
## [1] 4898 12
#List first 6 rows with headers
head(white.wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
#Summarizing data
summary(white.wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
#Boxplodding each attribute
old.par <- par(mar=c(1,1,1,1),mfrow=c(1,12),ps=6)
for(i in colnames(white.wine)){
boxplot(white.wine[,i],main=i)
}
par(old.par)
#Pair/Plot all 12 attributes of untransformed data
pairs(white.wine,pch=19,cex=white.wine$quality^3/1000, col=hsv(50/360, 1, 1-white.wine$quality/10))
#pairs(white.wine,pch=19,cex=.3, col=white.wine$quality)
#Pair/Plot all 12 attributes of log-transformed data
#pairs(white.wine+1,pch=19,cex=.3,log="xy", col=white.wine$quality)
pairs(white.wine+1,pch=19,cex=white.wine$quality^3/1000,log="xy", col=hsv(50/360, 1, 1-white.wine$quality/10))
#Outlier detecting
#plot(white.wine[,c(5,1)])
#identify(white.wine[,c(5,1)])
#old.par<-par(mfrow=c(2,2))
#plot(lm(white.wine$quality~white.wine$density+white.wine$chlorides, white.wine))
#2782 is outliers, but does not have large effect on model white.wine$quality~white.wine[,1]+white.wine[,5]
#Pair/Plot all 12 attributes of log-transformed data without 2 outliers
pairs(subset(white.wine, white.wine$density<1.0102)+1,pch=19,cex=white.wine$quality^3/1000,log="xy", col=hsv(50/360, 1, 1-white.wine$quality/10))
#Checking if there is zeros or NA in data
#length(white.wine[white.wine == 0])
#length(which(is.na(white.wine)))
#Check if log-transformation improves the data
old.par <- par(mar=c(1,1,1,1), mfrow=c(6,4),ps=6)
for ( col.n in colnames(white.wine)[1:6] ) {
hist(white.wine[,col.n],col='darkgoldenrod1',xlab=col.n, main=col.n)
qqnorm(white.wine[,col.n])
qqline(white.wine[,col.n], col="red")
hist(log(white.wine[,col.n]+1),col='lightblue',xlab=col.n, main=col.n)
qqnorm(log(white.wine[,col.n]+1))
qqline(log(white.wine[,col.n]+1), col="red")
}
par(old.par)
#Next 6 attributes
old.par <- par(mar=c(1,1,1,1), mfrow=c(6,4),ps=6)
for ( col.n in colnames(white.wine)[7:12] ) {
hist(white.wine[,col.n],col='darkgoldenrod1',xlab=col.n, main=col.n)
qqnorm(white.wine[,col.n])
qqline(white.wine[,col.n], col="red")
hist(log(white.wine[,col.n]+1),col='lightblue',xlab=col.n, main=col.n)
qqnorm(log(white.wine[,col.n]+1))
qqline(log(white.wine[,col.n]+1), col="red")
}
par(old.par)
Log-transformation improves all data attributes except “total.sulfur.dioxide” and “quality”. Log-transformation is justified for other attributes.
#Log-transforming all attributes except "total.sulfur.dioxide" and "quality"
white.wine.log = white.wine
for(cn in colnames(white.wine.log[,!names(white.wine.log) %in% c("total.sulfur.dioxide","quality")])){
white.wine.log[,cn] = log(white.wine[,cn]+1)
#colnames(white.wine.log)[names(white.wine.log) == cn] = paste0("L.",cn)
}
#Correlation matrix
white.wine.cr=cor(white.wine,use="complete")
signif(white.wine.cr,3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.0227 0.28900
## volatile.acidity -0.0227 1.0000 -0.14900
## citric.acid 0.2890 -0.1490 1.00000
## residual.sugar 0.0890 0.0643 0.09420
## chlorides 0.0231 0.0705 0.11400
## free.sulfur.dioxide -0.0494 -0.0970 0.09410
## total.sulfur.dioxide 0.0911 0.0893 0.12100
## density 0.2650 0.0271 0.15000
## pH -0.4260 -0.0319 -0.16400
## sulphates -0.0171 -0.0357 0.06230
## alcohol -0.1210 0.0677 -0.07570
## quality -0.1140 -0.1950 -0.00921
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.0890 0.0231 -0.049400
## volatile.acidity 0.0643 0.0705 -0.097000
## citric.acid 0.0942 0.1140 0.094100
## residual.sugar 1.0000 0.0887 0.299000
## chlorides 0.0887 1.0000 0.101000
## free.sulfur.dioxide 0.2990 0.1010 1.000000
## total.sulfur.dioxide 0.4010 0.1990 0.616000
## density 0.8390 0.2570 0.294000
## pH -0.1940 -0.0904 -0.000618
## sulphates -0.0267 0.0168 0.059200
## alcohol -0.4510 -0.3600 -0.250000
## quality -0.0976 -0.2100 0.008160
## total.sulfur.dioxide density pH sulphates
## fixed.acidity 0.09110 0.2650 -0.426000 -0.0171
## volatile.acidity 0.08930 0.0271 -0.031900 -0.0357
## citric.acid 0.12100 0.1500 -0.164000 0.0623
## residual.sugar 0.40100 0.8390 -0.194000 -0.0267
## chlorides 0.19900 0.2570 -0.090400 0.0168
## free.sulfur.dioxide 0.61600 0.2940 -0.000618 0.0592
## total.sulfur.dioxide 1.00000 0.5300 0.002320 0.1350
## density 0.53000 1.0000 -0.093600 0.0745
## pH 0.00232 -0.0936 1.000000 0.1560
## sulphates 0.13500 0.0745 0.156000 1.0000
## alcohol -0.44900 -0.7800 0.121000 -0.0174
## quality -0.17500 -0.3070 0.099400 0.0537
## alcohol quality
## fixed.acidity -0.1210 -0.11400
## volatile.acidity 0.0677 -0.19500
## citric.acid -0.0757 -0.00921
## residual.sugar -0.4510 -0.09760
## chlorides -0.3600 -0.21000
## free.sulfur.dioxide -0.2500 0.00816
## total.sulfur.dioxide -0.4490 -0.17500
## density -0.7800 -0.30700
## pH 0.1210 0.09940
## sulphates -0.0174 0.05370
## alcohol 1.0000 0.43600
## quality 0.4360 1.00000
#Correlation matrix visualized
corrPairs(white.wine.cr)
#Correlation matrix of log-transformed data
white.wine.log.cr=cor(white.wine.log,use="complete")
signif(white.wine.log.cr,3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.0306 0.30400
## volatile.acidity -0.0306 1.0000 -0.17100
## citric.acid 0.3040 -0.1710 1.00000
## residual.sugar 0.0874 0.09 0.07100
## chlorides 0.0341 0.0682 0.10700
## free.sulfur.dioxide -0.0465 -0.1130 0.08690
## total.sulfur.dioxide 0.1010 0.0967 0.12000
## density 0.2760 0.0253 0.14600
## pH -0.4350 -0.0346 -0.16600
## sulphates -0.0153 -0.0373 0.06720
## alcohol -0.1250 0.0577 -0.06890
## quality -0.1100 -0.1960 0.00599
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.0874 0.0341 -0.0465
## volatile.acidity 0.09 0.0682 -0.1130
## citric.acid 0.0710 0.1070 0.0869
## residual.sugar 1.0000 0.0836 0.3260
## chlorides 0.0836 1.0000 0.0957
## free.sulfur.dioxide 0.3260 0.0957 1.0000
## total.sulfur.dioxide 0.4210 0.2070 0.6000
## density 0.7780 0.2690 0.2850
## pH -0.1840 -0.0906 0.0217
## sulphates -0.0324 0.0237 0.0631
## alcohol -0.4250 -0.3750 -0.2310
## quality -0.0753 -0.2160 0.0949
## total.sulfur.dioxide density pH sulphates
## fixed.acidity 0.10100 0.2760 -0.43500 -0.0153
## volatile.acidity 0.09670 0.0253 -0.03460 -0.0373
## citric.acid 0.12000 0.1460 -0.16600 0.0672
## residual.sugar 0.42100 0.7780 -0.18400 -0.0324
## chlorides 0.20700 0.2690 -0.09060 0.0237
## free.sulfur.dioxide 0.60000 0.2850 0.02170 0.0631
## total.sulfur.dioxide 1.00000 0.5300 0.00347 0.1430
## density 0.53000 1.0000 -0.09480 0.0823
## pH 0.00347 -0.0948 1.00000 0.1580
## sulphates 0.14300 0.0823 0.15800 1.0000
## alcohol -0.45400 -0.7860 0.12900 -0.0252
## quality -0.17500 -0.3070 0.09940 0.0484
## alcohol quality
## fixed.acidity -0.1250 -0.11000
## volatile.acidity 0.0577 -0.19600
## citric.acid -0.0689 0.00599
## residual.sugar -0.4250 -0.07530
## chlorides -0.3750 -0.21600
## free.sulfur.dioxide -0.2310 0.09490
## total.sulfur.dioxide -0.4540 -0.17500
## density -0.7860 -0.30700
## pH 0.1290 0.09940
## sulphates -0.0252 0.04840
## alcohol 1.0000 0.43200
## quality 0.4320 1.00000
corrPairs(white.wine.log.cr)
There is weak negative correlation (-0.31) between “quality” attribute “density” and positive correlation (0.43) between “quality” attribute “alcohol”.
There is negative correlation between “alcohol” and “density” (-0.79), week negative correlation between “alcohol” and “residual.sugar”,“free.sulfur.dioxide” (-0.43,-0.45 respectively). There is strong positive correlation between “residual.sugar” and “density” (0.78). There is negative correlation between “total.sulfur.dioxide” and “free.sulfur.dioxide” (0.6).
Because of the strong correlation between “resudial.sugar” and “density” (0.78), and “alcohol” and “density” (-0.79), if we skip “density” attribute we can avoid collinearity problems.
#Findung collinearity
vif(lm(quality~.,data=white.wine.log))
## fixed.acidity volatile.acidity citric.acid
## 1.965851 1.159768 1.183259
## residual.sugar chlorides free.sulfur.dioxide
## 5.246982 1.222915 1.751950
## total.sulfur.dioxide density pH
## 2.176463 12.033690 1.690311
## sulphates alcohol
## 1.103493 4.883773
VIF value for “density” is greater than 10. The attribute “density” will be excluded.
corrPairs(cor(white.wine.log,use="complete"))
Now all VIF values are close to 1.
#Excluding "density" attribute
white.wine.log = subset(white.wine.log, select=-c(density))
#Checking collinearity
vif(lm(quality~.,data=white.wine.log))
## fixed.acidity volatile.acidity citric.acid
## 1.375078 1.158206 1.175535
## residual.sugar chlorides free.sulfur.dioxide
## 1.440990 1.215371 1.727452
## total.sulfur.dioxide pH sulphates
## 2.152719 1.340710 1.062249
## alcohol
## 1.638456
Use regsubsets from library leaps to choose optimal set of variables for modeling wine quality for red and white wine (separately), describe differences and similarities between attributes deemed important in each case.
Red Wine:
#Finding values for metrics("rsq","rss","adjr2","cp","bic") with three methods ("exhaustive", "backward", "forward") for different number of attributes in model
summaryMetrics <- NULL
whichAll <- list()
regsubsetsAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward") ) {
rsRes <- regsubsets(red.wine.log$quality~.,red.wine.log,method=myMthd,nvmax=11,really.big = T)
regsubsetsAll[[myMthd]] <- rsRes
summRes <- summary(rsRes)
whichAll[[myMthd]] <- summRes$which
for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
summaryMetrics <- rbind(summaryMetrics,
data.frame(method=myMthd,metric=metricName,
nvars=1:length(summRes[[metricName]]),
value=summRes[[metricName]]))
}
}
#Plotting results
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")
All three variable selection methods when applied to the entire dataset yield models with very similar fit metrics. For all of them, except for BIC and CP, increase in variable number appears to result in progressive improvement of the fit. BIC reaches minimum when 7 out of 11 variables are in the model. CP at 9. After adding 2nd and 3rd variable to the model shows signifficant and until 7th variable slight improvement.
Using 6 variables;
#Exhaustive method
red.wine.log.set6 = names(whichAll$exhaustive[6,whichAll$exhaustive[6,]==TRUE])[-1]
red.wine.log.set6
## [1] "volatile.acidity" "chlorides" "total.sulfur.dioxide"
## [4] "pH" "sulphates" "alcohol"
#Backward method
names(whichAll$backward[6,whichAll$backward[6,]==TRUE])[-1]
## [1] "volatile.acidity" "chlorides" "total.sulfur.dioxide"
## [4] "pH" "sulphates" "alcohol"
#Forward method
names(whichAll$forward[6,whichAll$forward[6,]==TRUE])[-1]
## [1] "volatile.acidity" "chlorides" "total.sulfur.dioxide"
## [4] "pH" "sulphates" "alcohol"
All methods found same 6 variables for the model.
#Visualizing regsubset results
old.par <- par(mfrow=c(1,3),ps=10)
for ( myMthd in names(regsubsetsAll) ) {
plot(regsubsetsAll[[myMthd]],main=myMthd)
}
par(old.par)
It looks like in this case all three methods choose the same variables when applied to the red wine dataset for a given variable number.
Same conclusion can be obtained when just visualizing variable membership in the models in the order of their size:
old.par <- par(mfrow=c(1,3),ps=10)
for ( myMthd in names(whichAll) ) {
image(1:nrow(whichAll[[myMthd]]),
1:ncol(whichAll[[myMthd]]),
whichAll[[myMthd]],xlab="N(vars)",ylab="",
xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
col=c("white","gray"),main=myMthd)
axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
par(old.par)
predict.regsubsets <- function (object, newdata, id, ...){
form=as.formula(object$call [[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names (coefi)
mat[,xvars] %*% coefi
}
dfTmp <- NULL
whichSum <- array(0,dim=c(11,12,3),
dimnames=list(NULL,colnames(model.matrix(quality~.,red.wine.log)),
c("exhaustive", "backward", "forward")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(red.wine.log)))
# Try each method available in regsubsets
# to select the best model of each size:
for ( jSelect in c("exhaustive", "backward", "forward") ) {
rsTrain <- regsubsets(quality~.,red.wine.log[bTrain,],nvmax=11,method=jSelect)
# Add up variable selections:
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
# Calculate test error for each set of variables
# using predict.regsubsets implemented above:
for ( kVarSet in 1:11 ) {
# make predictions:
testPred <- predict(rsTrain,red.wine.log[!bTrain,],id=kVarSet)
# calculate MSE:
mseTest <- mean((testPred-red.wine.log[!bTrain,"quality"])^2)
# add to data.frame for future plotting:
dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
}
}
}
#Mean test MSE
mean(dfTmp[dfTmp$vars == 6 & dfTmp$sel == "exhaustive" & dfTmp$trainTest == "test", "mse"])
## [1] 0.4284542
# plot MSEs by training/test, number of
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)
White Wine:
#Finding values for metrics("rsq","rss","adjr2","cp","bic") with three methods ("exhaustive", "backward", "forward") for different number of attributes in model
summaryMetrics <- NULL
whichAll <- list()
regsubsetsAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward") ) {
rsRes <- regsubsets(white.wine.log$quality~.,white.wine.log,method=myMthd,nvmax=10,really.big = T)
regsubsetsAll[[myMthd]] <- rsRes
summRes <- summary(rsRes)
whichAll[[myMthd]] <- summRes$which
for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
summaryMetrics <- rbind(summaryMetrics,
data.frame(method=myMthd,metric=metricName,
nvars=1:length(summRes[[metricName]]),
value=summRes[[metricName]]))
}
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")
All three variable selection methods when applied to the entire dataset yield models with very similar fit metrics. For RSQ and RSS increase in variable number appears to result in progressive improvement of the fit. AdjR2, BIC and CP reaches best fit when 6 out of 10 variables are in the model. After adding 2nd, 3rd and 4th variable the model shows signifficant and until 6th variable slight improvement.
Using 5 variables;
#Exhaustive method
white.wine.log.set5 = names(whichAll$exhaustive[5,whichAll$exhaustive[5,]==TRUE])[-1]
white.wine.log.set5
## [1] "volatile.acidity" "residual.sugar" "free.sulfur.dioxide"
## [4] "total.sulfur.dioxide" "alcohol"
#Backward method
names(whichAll$backward[5,whichAll$backward[5,]==TRUE])[-1]
## [1] "volatile.acidity" "residual.sugar" "free.sulfur.dioxide"
## [4] "total.sulfur.dioxide" "alcohol"
#Forward method
names(whichAll$forward[5,whichAll$forward[5,]==TRUE])[-1]
## [1] "volatile.acidity" "residual.sugar" "free.sulfur.dioxide"
## [4] "total.sulfur.dioxide" "alcohol"
#Visualizing regsubset results
old.par <- par(mfrow=c(1,3),ps=10)
for ( myMthd in names(regsubsetsAll) ) {
plot(regsubsetsAll[[myMthd]],main=myMthd)
}
par(old.par)
All three variable selection methods choose the same variables when applied to the white wine dataset for a given variable number.
Same conclusion can be obtained when just visualizing variable membership in the models in the order of their size:
old.par <- par(mfrow=c(1,3),ps=10)
for ( myMthd in names(whichAll) ) {
image(1:nrow(whichAll[[myMthd]]),
1:ncol(whichAll[[myMthd]]),
whichAll[[myMthd]],xlab="N(vars)",ylab="",
xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
col=c("white","gray"),main=myMthd)
axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
par(old.par)
dfTmp <- NULL
whichSum <- array(0,dim=c(10,11,3),
dimnames=list(NULL,colnames(model.matrix(quality~.,white.wine.log)),
c("exhaustive", "backward", "forward")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(white.wine.log)))
# Try each method available in regsubsets
# to select the best model of each size:
for ( jSelect in c("exhaustive", "backward", "forward") ) {
rsTrain <- regsubsets(quality~.,white.wine.log[bTrain,],nvmax=10,method=jSelect)
# Add up variable selections:
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
# Calculate test error for each set of variables
# using predict.regsubsets implemented above:
for ( kVarSet in 1:10 ) {
# make predictions:
testPred <- predict(rsTrain,white.wine.log[!bTrain,],id=kVarSet)
# calculate MSE:
mseTest <- mean((testPred-white.wine.log[!bTrain,"quality"])^2)
# add to data.frame for future plotting:
dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
}
}
}
#Mean test MSE
mean(dfTmp[dfTmp$vars == 5 & dfTmp$sel == "exhaustive" & dfTmp$trainTest == "test", "mse"])
## [1] 0.5662291
# plot MSEs by training/test, number of
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)
#Comparing attributes for red and whine types
wine.model.table = table(c(red.wine.log.set6,white.wine.log.set5))
wine.model.table
##
## alcohol chlorides free.sulfur.dioxide
## 2 1 1
## pH residual.sugar sulphates
## 1 1 1
## total.sulfur.dioxide volatile.acidity
## 2 2
These attributes are used for both wine types:
#Both models for wine types use these 3 attributes
names(wine.model.table[wine.model.table==2])
## [1] "alcohol" "total.sulfur.dioxide" "volatile.acidity"
Top 3 attributes for both wine type for all methods in models are “alcohol”, “volatile.acidity” and “sulphates” (ordered by from top 1st to 3rd).
Use cross-validation (or any other resampling strategy of your choice) to estimate test error for models with different numbers of variables. Compare and comment on the number of variables deemed optimal by resampling versus those selected by regsubsets in the previous task. Compare resulting models built separately for red and white wine data.
Red Wine:
#Cross-validation function
xvalMSEregsubsetsWine <- function(wine,nTries=30,kXval=5) {
retRes <- NULL
for ( iTry in 1:nTries ) {
xvalFolds <- sample(rep(1:kXval,length.out=nrow(wine)))
# Try each method available in regsubsets
# to select the best model of each size:
for ( jSelect in c("exhaustive", "backward", "forward") ) {
mthdTestErr2 <- NULL
mthdTrainErr2 <- NULL
mthdTestFoldErr2 <- NULL
for ( kFold in 1:kXval ) {
rsTrain <- regsubsets(quality~.,wine[xvalFolds!=kFold,],nvmax=(length(wine)-1),method=jSelect)
# Calculate test error for each set of variables
# using predict.regsubsets implemented above:
nVarTestErr2 <- NULL
nVarTrainErr2 <- NULL
for ( kVarSet in 1:(length(wine)-1) ) {
# make predictions for given number of variables and cross-validation fold:
kCoef <- coef(rsTrain,id=kVarSet)
testPred <- model.matrix(quality~.,wine[xvalFolds==kFold,])[,names(kCoef)] %*% kCoef
nVarTestErr2 <- cbind(nVarTestErr2,(testPred-wine[xvalFolds==kFold,"quality"])^2)
trainPred <- model.matrix(quality~.,wine[xvalFolds!=kFold,])[,names(kCoef)] %*% kCoef
nVarTrainErr2 <- cbind(nVarTrainErr2,(trainPred-wine[xvalFolds!=kFold,"quality"])^2)
}
# accumulate training and test errors over all cross-validation folds:
mthdTestErr2 <- rbind(mthdTestErr2,nVarTestErr2)
mthdTestFoldErr2 <- rbind(mthdTestFoldErr2,colMeans(nVarTestErr2))
mthdTrainErr2 <- rbind(mthdTrainErr2,nVarTrainErr2)
}
#cat(kXval,iTry,jSelect,dim(mthdTrainErr2),dim(mthdTestErr2),fill=TRUE)
# add to data.frame for future plotting:
retRes <- rbind(retRes,
data.frame(sim=iTry,sel=jSelect,vars=1:ncol(nVarTrainErr2),mse=colMeans(mthdTrainErr2),trainTest="train"),
data.frame(sim=iTry,sel=jSelect,vars=1:ncol(nVarTrainErr2),mse=colMeans(mthdTestErr2),trainTest="test"))
#for ( iRow in 1:nrow(mthdTestFoldErr2) ) {
# retRes <- rbind(retRes,data.frame(sim=iTry,sel=jSelect,vars=1:ncol(mthdTestFoldErr2),mse=mthdTestFoldErr2[iRow,],trainTest="testFold"))
#}
}
}
retRes
}
#Plotting MSEs by training/test, number of variables:
dfTmp <- rbind(data.frame(xvalMSEregsubsetsWine(red.wine.log,30,kXval=2),xval="2-fold"),
data.frame(xvalMSEregsubsetsWine(red.wine.log,30,kXval=5),xval="5-fold"),
data.frame(xvalMSEregsubsetsWine(red.wine.log,30,kXval=10),xval="10-fold"))
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest+xval)
From the ggplot we can see that adding 2nd and 3rd attribute decreases MSE significantly. The three methods yield models of very comparable performance. Variability of test error is higher that training error. Model with 3 or 4 variables could be justified.
Detecting best 3 attributes based on CV results:
#Listing attributes of the optimal model
rs.red.wine.log=regsubsets(quality~.,red.wine.log, nvmax=11)
rs.red.wine.summary <- as.data.frame(summary(rs.red.wine.log)$outmat)
rs.red.tmp=rs.red.wine.summary[3,]
red.wine.log.rs3 = names(rs.red.tmp[,rs.red.tmp[nrow(rs.red.tmp),]=="*",drop=FALSE])
red.wine.log.rs3
## [1] "volatile.acidity" "sulphates" "alcohol"
#Both redsubset and CV use these 3 attributes in the model
red.wine.set.rs.table = table(c(red.wine.log.set6,red.wine.log.rs3))
names(red.wine.set.rs.table[red.wine.set.rs.table==2])
## [1] "alcohol" "sulphates" "volatile.acidity"
red.wine.lm.rs3 = lm(quality~.,red.wine.log[,c(red.wine.log.rs3, "quality")])
red.wine.lm.set6 = lm(quality~.,red.wine.log[,c(red.wine.log.set6, "quality")])
#red.wine.lm.rs3.mse <- function(red.wine.lm.rs3) mean(red.wine.lm.rs3$residuals^2)
#red.wine.lm.rs3.mse
#red.wine.lm.set6.mse <- function(red.wine.lm.set6) mean(red.wine.lm.set6$residuals^2)
#red.wine.lm.set6.mse
#MSE of the optimal model with CV
mean((red.wine.log[,"quality"] - predict(red.wine.lm.rs3))^2)
## [1] 0.4300604
#MSE of the optimal model with regsubset
mean((red.wine.log[,"quality"] - predict(red.wine.lm.set6))^2)
## [1] 0.4190814
#Summarizing CV model
summary(red.wine.lm.rs3)
##
## Call:
## lm(formula = quality ~ ., data = red.wine.log[, c(red.wine.log.rs3,
## "quality")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.74451 -0.37492 -0.05681 0.47051 2.14453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.0291 0.4763 -6.359 2.65e-10 ***
## volatile.acidity -1.8406 0.1518 -12.127 < 2e-16 ***
## sulphates 1.3830 0.1830 7.557 6.93e-14 ***
## alcohol 3.5945 0.1862 19.300 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6566 on 1595 degrees of freedom
## Multiple R-squared: 0.3402, Adjusted R-squared: 0.3389
## F-statistic: 274.1 on 3 and 1595 DF, p-value: < 2.2e-16
#Summarizing regsubset model
summary(red.wine.lm.set6)
##
## Call:
## lm(formula = quality ~ ., data = red.wine.log[, c(red.wine.log.set6,
## "quality")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6083 -0.3535 -0.0511 0.4602 1.9028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26760 0.80716 0.332 0.740283
## volatile.acidity -1.56577 0.15782 -9.921 < 2e-16 ***
## chlorides -2.38223 0.47878 -4.976 7.21e-07 ***
## total.sulfur.dioxide -0.08178 0.02465 -3.317 0.000930 ***
## pH -1.76186 0.50136 -3.514 0.000453 ***
## sulphates 1.74939 0.19874 8.802 < 2e-16 ***
## alcohol 3.37798 0.20138 16.774 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6488 on 1592 degrees of freedom
## Multiple R-squared: 0.357, Adjusted R-squared: 0.3546
## F-statistic: 147.3 on 6 and 1592 DF, p-value: < 2.2e-16
Both models return comparable results with RSE about 0.65 on about 1592 degrees of freedom and AdjR2 is about 0.35 for both models.
Values of coefficents are different, it can be due hidden collinearity between variables.
White Wine:
#Plotting MSEs by training/test, number of variables:
dfTmp <- rbind(data.frame(xvalMSEregsubsetsWine(white.wine.log,30,kXval=2),xval="2-fold"),
data.frame(xvalMSEregsubsetsWine(white.wine.log,30,kXval=5),xval="5-fold"),
data.frame(xvalMSEregsubsetsWine(white.wine.log,30,kXval=10),xval="10-fold"))
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest+xval)
From the ggplot we can see that adding 2nd and 3rd attribute decreases MSE significantly. The three methods yield models of very comparable performance. Variability of test error is higher that training error. Model with 3 or 4 variables could be justified.
Detecting best 3 attributes based on CV results:
#Listing attributes of the optimal model
rs.white.wine.log=regsubsets(quality~.,white.wine.log, nvmax=11)
rs.white.wine.summary <- as.data.frame(summary(rs.white.wine.log)$outmat)
rs.white.tmp=rs.white.wine.summary[3,]
white.wine.log.rs3 = names(rs.white.tmp[,rs.white.tmp[nrow(rs.white.tmp),]=="*",drop=FALSE])
white.wine.log.rs3
## [1] "volatile.acidity" "free.sulfur.dioxide" "alcohol"
white.wine.set.rs.table = table(c(white.wine.log.set5,white.wine.log.rs3))
names(white.wine.set.rs.table[white.wine.set.rs.table==2])
## [1] "alcohol" "free.sulfur.dioxide" "volatile.acidity"
white.wine.lm.rs3 = lm(quality~.,white.wine.log[,c(white.wine.log.rs3, "quality")])
white.wine.lm.set5 = lm(quality~.,white.wine.log[,c(white.wine.log.set5, "quality")])
#white.wine.lm.rs3.mse <- function(white.wine.lm.rs3) mean(white.wine.lm.rs3$residuals^2)
#white.wine.lm.rs3.mse
#white.wine.lm.set5.mse <- function(white.wine.lm.set5) mean(white.wine.lm.set5$residuals^2)
#white.wine.lm.set5.mse
#MSE of the optimal model with CV
mean((white.wine.log[,"quality"] - predict(white.wine.lm.rs3))^2)
## [1] 0.5746447
#MSE of the optimal model with regsubset
mean((white.wine.log[,"quality"] - predict(white.wine.lm.set5))^2)
## [1] 0.561256
#Summarizing CV model
summary(white.wine.lm.rs3)
##
## Call:
## lm(formula = quality ~ ., data = white.wine.log[, c(white.wine.log.rs3,
## "quality")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6117 -0.4986 -0.0470 0.4744 3.1645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.61910 0.28708 -16.09 <2e-16 ***
## volatile.acidity -2.38976 0.14510 -16.47 <2e-16 ***
## free.sulfur.dioxide 0.31334 0.02147 14.60 <2e-16 ***
## alcohol 4.09708 0.10603 38.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7584 on 4894 degrees of freedom
## Multiple R-squared: 0.2672, Adjusted R-squared: 0.2668
## F-statistic: 594.9 on 3 and 4894 DF, p-value: < 2.2e-16
#Summarizing regsubset model
summary(white.wine.lm.set5)
##
## Call:
## lm(formula = quality ~ ., data = white.wine.log[, c(white.wine.log.set5,
## "quality")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5452 -0.5005 -0.0362 0.4571 3.0220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.9905752 0.3177966 -15.704 < 2e-16 ***
## volatile.acidity -2.3813164 0.1489658 -15.986 < 2e-16 ***
## residual.sugar 0.1634996 0.0173281 9.435 < 2e-16 ***
## free.sulfur.dioxide 0.3580011 0.0266641 13.426 < 2e-16 ***
## total.sulfur.dioxide -0.0023278 0.0003581 -6.500 8.83e-11 ***
## alcohol 4.1995906 0.1215869 34.540 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7496 on 4892 degrees of freedom
## Multiple R-squared: 0.2843, Adjusted R-squared: 0.2836
## F-statistic: 388.6 on 5 and 4892 DF, p-value: < 2.2e-16
Both models return comparable results with RSE about 0.75 on about 4891 degrees of freedom and AdjR2 is about 0.287 for both models.
Coefficents for two models are comparable.
Use regularized approaches (i.e. lasso and ridge) to model quality of red and white wine (separately). Compare resulting models (in terms of number of variables and their effects) to those selected in the previous two tasks (by regsubsets and resampling), comment on differences and similarities among them.
Red Wine:
Lasso
x <- model.matrix(quality~.,red.wine.log)[,-1]
y <- red.wine.log[,"quality"]
lassoRes <- glmnet(x,y,alpha=1)
#plot(lassoRes)
#head(x)
#head(y)
#With default λ’s eleventh variable doesn’t enter the model
cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)
#cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-200:20)/80))
#plot(cvLassoRes)
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 38.23272932
## fixed.acidity 0.45110221
## volatile.acidity -1.68217813
## citric.acid -0.26814100
## residual.sugar 0.09214019
## chlorides -2.04553674
## free.sulfur.dioxide 0.10282436
## total.sulfur.dioxide -0.14706971
## density -56.59623090
## pH -1.23353127
## sulphates 1.81496699
## alcohol 3.09630318
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.5380136
## fixed.acidity .
## volatile.acidity -1.4827296
## citric.acid .
## residual.sugar .
## chlorides .
## free.sulfur.dioxide .
## total.sulfur.dioxide .
## density .
## pH .
## sulphates 0.8686966
## alcohol 3.0258839
#Similarly to what was seen above, optimal (in min-1SE sense) model by lasso includes five variables except for MYCT
lassoResScaled <- glmnet(scale(x),y,alpha=1)
cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.636022514
## fixed.acidity .
## volatile.acidity -0.173858630
## citric.acid .
## residual.sugar .
## chlorides .
## free.sulfur.dioxide .
## total.sulfur.dioxide -0.003127056
## density .
## pH .
## sulphates 0.086027559
## alcohol 0.277475428
Lasso on red wine returned comparable results with 6 variables as regsubsets, with difference that instead of “pH” varibale “fixed.acidity”. Test MSE is slightly higher 0.45 (vs 0.43). Resampling results with threee variables (“alcohol”, “sulphates”,“volatile.acidity”) are in top 3 coefficent in lasso.
lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 0.453249
quantile(lassoMSE)
## 0% 25% 50% 75% 100%
## 0.4196183 0.4377595 0.4544767 0.4642479 0.4999931
lassoCoefCnt
## fixed.acidity volatile.acidity citric.acid
## 6 30 1
## residual.sugar chlorides free.sulfur.dioxide
## 0 4 0
## total.sulfur.dioxide density pH
## 8 0 4
## sulphates alcohol
## 30 30
Ridge
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)
cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)
cvRidgeRes$lambda.min
## [1] 0.0422638
cvRidgeRes$lambda.1se
## [1] 0.4325831
#With default λ’s the lowest MSE is attained for the least regularized model (for the lowest λ)
#cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-50:60)/20))
#plot(cvRidgeRes)
#cvRidgeRes$lambda.min
#cvRidgeRes$lambda.1se
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 47.06938544
## fixed.acidity 0.43923453
## volatile.acidity -1.57485179
## citric.acid -0.16220682
## residual.sugar 0.10269990
## chlorides -2.04327675
## free.sulfur.dioxide 0.08804357
## total.sulfur.dioxide -0.13695454
## density -69.16741864
## pH -1.00235782
## sulphates 1.76033582
## alcohol 2.88585157
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 46.30667019
## fixed.acidity 0.23475300
## volatile.acidity -1.14508315
## citric.acid 0.15464683
## residual.sugar 0.06651849
## chlorides -1.56800056
## free.sulfur.dioxide 0.02536607
## total.sulfur.dioxide -0.08589000
## density -65.73562003
## pH -0.46594150
## sulphates 1.25020851
## alcohol 2.07731031
#As expected, for more regularized model (using 1SE rule) coefficients are smaller by absolute value than those at the minimum of MSE
ridgeResScaled <- glmnet(scale(x),y,alpha=0)
cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0,lambda=10^((-50:60)/20))
predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.63602251
## fixed.acidity 0.03725670
## volatile.acidity -0.12270267
## citric.acid 0.03421604
## residual.sugar 0.01587369
## chlorides -0.05614895
## free.sulfur.dioxide 0.01134691
## total.sulfur.dioxide -0.05360908
## density -0.05890823
## pH -0.01458087
## sulphates 0.10722510
## alcohol 0.17333858
Top 4 variables is also selected by regsubsets. Resampling results with threee varables (“alcohol”, “sulphates”,“volatile.acidity”) also have top 3 coefficent in ridge. Intercept for lasso and ridge methods are comparable. MSE is comparable to lasso.
ridgeCoefCnt <- 0
ridgeCoefAve <- 0
ridgeMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
cvridgeTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
ridgeTrain <- glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
ridgeTrainCoef <- predict(ridgeTrain,type="coefficients",s=cvridgeTrain$lambda.1se)
ridgeCoefCnt <- ridgeCoefCnt + (ridgeTrainCoef[-1,1]!=0)
ridgeCoefAve <- ridgeCoefAve + ridgeTrainCoef[-1,1]
ridgeTestPred <- predict(ridgeTrain,newx=x[!bTrain,],s=cvridgeTrain$lambda.1se)
ridgeMSE <- c(ridgeMSE,mean((ridgeTestPred-y[!bTrain])^2))
}
ridgeCoefAve <- ridgeCoefAve / length(ridgeMSE)
ridgeCoefAve
## fixed.acidity volatile.acidity citric.acid
## 0.22034365 -1.06909608 0.17773746
## residual.sugar chlorides free.sulfur.dioxide
## 0.06716620 -1.51888641 0.02071008
## total.sulfur.dioxide density pH
## -0.07601992 -60.13585517 -0.47194907
## sulphates alcohol
## 1.15435841 1.92691904
mean(ridgeMSE)
## [1] 0.4500541
quantile(ridgeMSE)
## 0% 25% 50% 75% 100%
## 0.4002306 0.4393771 0.4516351 0.4596842 0.4903137
White wine
Lasso
x <- model.matrix(quality~.,white.wine.log)[,-1]
y <- white.wine.log[,"quality"]
lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)
#With default λ’s eleventh variable doesn’t enter the model
cvLassoRes <- cv.glmnet(x,y,alpha=1)
#plot(cvLassoRes)
#cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-200:20)/80))
#plot(cvLassoRes)
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -5.116687416
## fixed.acidity -0.286411920
## volatile.acidity -2.309866552
## citric.acid 0.046307121
## residual.sugar 0.171507922
## chlorides -1.194220914
## free.sulfur.dioxide 0.346725554
## total.sulfur.dioxide -0.002465644
## pH 0.635859902
## sulphates 0.670474810
## alcohol 4.037978134
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -3.3012462508
## fixed.acidity -0.1490326911
## volatile.acidity -2.0088516206
## citric.acid .
## residual.sugar 0.0717210982
## chlorides -0.5729267959
## free.sulfur.dioxide 0.2246102553
## total.sulfur.dioxide -0.0004180048
## pH .
## sulphates 0.1333045962
## alcohol 3.7316504071
#Similarly to what was seen above, optimal (in min-1SE sense) model by lasso includes five variables except for MYCT
lassoResScaled <- glmnet(scale(x),y,alpha=1)
cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.87790935
## fixed.acidity -0.01983232
## volatile.acidity -0.15534417
## citric.acid .
## residual.sugar 0.06387632
## chlorides -0.01378313
## free.sulfur.dioxide 0.12817935
## total.sulfur.dioxide -0.03229294
## pH .
## sulphates 0.01746241
## alcohol 0.39811840
Lasso on white wine returned comparable results with 5 variables as regsubsets, with difference that additionally “chlorides” and “sulphates” are included in lasso, but with small coefficent.. Test MSE is slightly higher 0.57 (vs 0.56). Resampling results with three variables (“volatile.acidity”,“free.sulfur.dioxide”,“alcohol”) are in top 3 coefficent in lasso.
lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 0.5762332
quantile(lassoMSE)
## 0% 25% 50% 75% 100%
## 0.5567583 0.5704082 0.5758174 0.5824260 0.6125151
lassoCoefCnt
## fixed.acidity volatile.acidity citric.acid
## 22 30 0
## residual.sugar chlorides free.sulfur.dioxide
## 30 24 30
## total.sulfur.dioxide pH sulphates
## 10 6 14
## alcohol
## 30
Ridge
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)
cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)
cvRidgeRes$lambda.min
## [1] 0.04197234
cvRidgeRes$lambda.1se
## [1] 0.2698013
#With default λ’s the lowest MSE is attained for the least regularized model (for the lowest λ)
#cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-50:60)/20))
#plot(cvRidgeRes)
#cvRidgeRes$lambda.min
#cvRidgeRes$lambda.1se
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -4.31557286
## fixed.acidity -0.31502445
## volatile.acidity -2.18189589
## citric.acid 0.09055180
## residual.sugar 0.15061177
## chlorides -1.68626410
## free.sulfur.dioxide 0.32862652
## total.sulfur.dioxide -0.00237491
## pH 0.64148161
## sulphates 0.64097515
## alcohol 3.76173901
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.500024080
## fixed.acidity -0.345275490
## volatile.acidity -1.698317641
## citric.acid 0.142207966
## residual.sugar 0.077099496
## chlorides -2.797200270
## free.sulfur.dioxide 0.242778921
## total.sulfur.dioxide -0.001816626
## pH 0.620833471
## sulphates 0.481160813
## alcohol 2.780037603
#As expected, for more regularized model (using 1SE rule) coefficients are smaller by absolute value than those at the minimum of MSE
ridgeResScaled <- glmnet(scale(x),y,alpha=0)
cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0,lambda=10^((-50:60)/20))
predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.87790935
## fixed.acidity -0.03654364
## volatile.acidity -0.13674338
## citric.acid 0.01197059
## residual.sugar 0.06739607
## chlorides -0.05201741
## free.sulfur.dioxide 0.13791681
## total.sulfur.dioxide -0.08281277
## pH 0.02223374
## sulphates 0.03872091
## alcohol 0.31683952
Top 5 variables in ridge are also selected by regsubsets. Resampling results with three varables (“alcohol”, “sulphates”,“volatile.acidity”) also are in top 3 coefficent in ridge. MSE is comparable to lasso.
ridgeCoefCnt <- 0
ridgeCoefAve <- 0
ridgeMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
cvridgeTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
ridgeTrain <- glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
ridgeTrainCoef <- predict(ridgeTrain,type="coefficients",s=cvridgeTrain$lambda.1se)
ridgeCoefCnt <- ridgeCoefCnt + (ridgeTrainCoef[-1,1]!=0)
ridgeCoefAve <- ridgeCoefAve + ridgeTrainCoef[-1,1]
ridgeTestPred <- predict(ridgeTrain,newx=x[!bTrain,],s=cvridgeTrain$lambda.1se)
ridgeMSE <- c(ridgeMSE,mean((ridgeTestPred-y[!bTrain])^2))
}
ridgeCoefAve <- ridgeCoefAve / length(ridgeMSE)
ridgeCoefAve
## fixed.acidity volatile.acidity citric.acid
## -0.343210096 -1.672566887 0.145148960
## residual.sugar chlorides free.sulfur.dioxide
## 0.077962901 -2.816598544 0.234733434
## total.sulfur.dioxide pH sulphates
## -0.001770157 0.596507682 0.487427508
## alcohol
## 2.751838813
mean(ridgeMSE)
## [1] 0.5780262
quantile(ridgeMSE)
## 0% 25% 50% 75% 100%
## 0.5468051 0.5664662 0.5769768 0.5901110 0.6071624
Merge data for red and white wine (function rbind allows merging of two matrices/data frames with the same number of columns) and plot data projection to the first two principal components (e.g. biplot or similar plots). Does this representation suggest presence of clustering structure in the data? Does wine type (i.e. red or white) or quality appear to be associated with different regions occupied by observations in the plot? Please remember not to include quality attribute or wine type (red or white) indicator in your merged data, otherwise, apparent association of quality or wine type with PCA layout will be influenced by presence of those indicators in your data.
#Setting color to darker and bigger for higher wine quality for better visualisation of results. Red color for red wine in the plot.
red.wine.tmp = red.wine
white.wine.tmp = white.wine
red.wine.tmp$color = 0
white.wine.tmp$color = 50/360
wine = rbind(red.wine.tmp,white.wine.tmp)
wine = wine[order(wine$quality),]
wine.inp = wine[,c(-12,-13)]
pc.wine = prcomp(wine.inp,scale=TRUE)
pc.wine.pc12 = pc.wine$x[,1:2]
plot(pc.wine.pc12,pch=16,cex=wine$quality^4/5400,xlim=c(-5,6.5),ylim=c(-6,7.5),col=hsv(wine$color, 1, 1-wine$quality/12))
arrows(0,0, pc.wine$rotation[,"PC1"]*12,pc.wine$rotation[,"PC2"]*12,length=0.1, lwd=1,angle=20, col="red")
text(pc.wine$rotation[,"PC1"]*13,pc.wine$rotation[,"PC2"]*13,rownames(pc.wine$rotation), col="red", cex=.4)
#Bigger circles means better quality
#red for red wine
We can see in this biplot presence of clear clustering structure in the data between wine types. Red wine tend to have higher than average amount of “valotile.accidity”, “sulphates”, “chlorites”, “fixed” than white wine. Whereas white wine tend to have higher than average amount of “total-/free.sulfur.dioxide” and “residual.sugar”. Generaly wine with higher quality have average and above average alcohol.
“free.sulfur.dioxide” and “total.sulfur.dioxide” are highly correlated.
Compute PCA representation of the data for one of the wine types (red or white) excluding wine quality attribute (of course!). Use resulting principal components (slot x in the output of prcomp) as new predictors to fit a linear model of wine quality as a function of these predictors. Compare resulting fit (in terms of MSE, r-squared, etc.) to those obtained above. Comment on the differences and similarities between these fits.
pc.red.wine.log = prcomp(red.wine.log[,-12],scale=TRUE)
pc.red.wine.log.pc12 = pc.red.wine.log$x[,1:2]
plot(pc.red.wine.log.pc12,pch=16,cex=red.wine.tmp$quality^4/2500,xlim=c(-5.5,6.5),ylim=c(-6,7.5),col=hsv(red.wine.tmp$color, 1, 1-red.wine.tmp$quality/12))
arrows(0,0, pc.red.wine.log$rotation[,"PC1"]*12,pc.red.wine.log$rotation[,"PC2"]*12,length=0.1, lwd=1,angle=20, col="red")
text(pc.red.wine.log$rotation[,"PC1"]*13,pc.red.wine.log$rotation[,"PC2"]*13,rownames(pc.red.wine.log$rotation), col="red", cex=.4)
#pc1.wine = pc.red.wine.log$rotation[,"PC1"]; pc1.wine[which.max(abs(pc1.wine))]
#pc2.wine = pc.red.wine.log$rotation[,"PC2"]; pc2.wine[which.max(abs(pc2.wine))]
pc.red.wine.log.var = (pc.red.wine.log$sdev^2)[1:11]
pve.red.wine.log.var = pc.red.wine.log.var/sum(pc.red.wine.log.var)
old.par <- par(mfrow=c(1,2),ps=10)
plot(pve.red.wine.log.var, main="Standardized", xlab="Principal Component", ylab="Proportion of Variance Explained ", ylim=c(0,1),type="b")
plot(cumsum(pve.red.wine.log.var), main="Standardized", xlab="Principal Component ", ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type='b')
par(old.par)
pcr.red.wine.log=pcr(quality~.,data=red.wine.log,validation="CV",scale=TRUE)
summary(pcr.red.wine.log)
## Data: X dimension: 1599 11
## Y dimension: 1599 1
## Fit method: svdpc
## Number of components considered: 11
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.8078 0.8032 0.7469 0.6676 0.6658 0.6618 0.6622
## adjCV 0.8078 0.8031 0.7466 0.6675 0.6657 0.6617 0.6621
## 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 0.6568 0.6555 0.6489 0.6493 0.6496
## adjCV 0.6566 0.6553 0.6487 0.6490 0.6493
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 28.593 46.93 61.23 71.98 81.14 87.12 92.20
## quality 1.327 14.98 32.00 32.42 33.28 33.33 34.52
## 8 comps 9 comps 10 comps 11 comps
## X 95.77 98.05 99.51 100.0
## quality 34.93 36.26 36.30 36.3
validationplot(pcr.red.wine.log,val.type = "MSEP")
x <- model.matrix(quality~.,red.wine.log)[,-1]
y <- red.wine.log[,"quality"]
set.seed (1)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
y.test=y[test]
set.seed(1)
pcr.red.wine.log=pcr(quality~.,data=red.wine.log,subset=train,validation="CV",scale=TRUE)
old.par <- par(mfrow=c(1,2),ps=10)
validationplot(pcr.red.wine.log,val.type="MSEP")
validationplot(pcr.red.wine.log,val.type="R2")
par(old.par)
pcr.red.wine.log.pred=predict(pcr.red.wine.log,x[test,],ncomp=6)
mean((pcr.red.wine.log.pred-y.test)^2)
## [1] 0.4045485
pcr.red.wine.log=pcr(y~x,scale=TRUE,ncomp=6)
summary(pcr.red.wine.log)
## Data: X dimension: 1599 11
## Y dimension: 1599 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 28.593 46.93 61.23 71.98 81.14 87.12
## y 1.327 14.98 32.00 32.42 33.28 33.33
6 components explain 87% variance of the data. MSE and R2 are lower by PCA but still comparable.